Introduction

This project intends to create a model that could help explain what factors might lead to cancer deaths. To do this, we will look at a dataset which contains data for counties in the primary 50 states in the United States. For every county, the dataset includes variables such as cancer diagnosis rates, population information, income, age, education levels, family sizes, marriage rates, insurance coverage, and employment rates. To do this we will look to out infatuation, and domain experience in the health field to find variables that can help explain the death rate

This report will walk through the process that was taken to make an effective and explanatory model. To do this the dataset must first be analyzed. Then variables will be selected as predictors and used in an initial model. The model will be examined to determine what predictors should stay and what should be removed. The model will be checked to make sure it meets all assumptions and can be used to make statistical inferences. The model will be finalized and inferences will be conducted, assuming the model assumptions are met.

Step 1 - Dataset Analysis

Plotting the distribution of the Death rate variable

After looking at the distribution we can see most of the counties have death rates per capita around 175. Because of the size of the datasest the variable is normally distributed. Looking specifically we see the average death rate below:

## [1] 178.6641

We can see the hugest death rates for counties is over 300 per capita of 100,000 and the lowest is around 60 per capita. We can also see a summary of the deathrate for easier viewing.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    59.7   161.2   178.1   178.7   195.2   362.8

Varaible selection:

Initially, I will explore the following variables in order to put them in to a model in future steps:

  • avgAnnCount - Mean number of reported cases of cancer diagnosed annually
  • IncidenceRate - Mean per capita (100,000) cancer diagnoses
  • MedianIncome - Median income per county
  • MedianAge - Median age of county residents
  • studyPerCap - Per capita number of cancer-related clinical trials per county
  • PctHS25_Over - Percent of county residents ages 25 and over highest education attained: high school diploma
  • PctPrivateCoverage - Percent of county residents with private health coverage
  • PctPublicCoverage - Percent of county residents with government-provided health coverage

These variables look to be the most promising. After spending nearly a decade working in the insurance industry, with some time spent selling heath insurance products, the variables should hopefully provide a good explanation of death rate variable.

Lets look at six out of the eight variables plotted against Death Rate

Looking at all of the plots generated, it looks like three of the six will have some correlation with the death rate. We should examine them all closer to see if this is the case.

AvgAnnCount

There does not appear to be a positive or negative relationship, the data looks to be centered around 200 for the deathrate with values \(\pm\) 100 on the y-axis. There could be some outliers, which we will examine more specifically later.

The statistical summary of the avgAnnCount follows:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     6.0    76.0   171.0   606.3   518.0 38150.0

Due to how spread out avgAnnCount is, we should limit some of what could be outliers

Looking at the limited plots there still is no trend distinguishable indicating any correlation between angAnnCount and Death rate.

Looking at the proportion and count of the number of counties and their average annual count of cancer diagnosis.

Incidence Rate

There appears to be a strong positive relationship between IncidenceRate and Deathrate. There are a few outliers, possibly. Again we will examine this more thoroughly later.

Statistical summary of the Incidence rate

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   201.3   420.3   453.5   448.3   480.9  1206.9

There is a large spread amongst the smallest and largest rate reported.

Median Income

There looks to be a moderate negative relationship between Median Income and Death Rate. Possibly a few outliers, if any.

summary(df$medIncome)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22640   38882   45207   47063   52492  125635

Again there is a large disparity between the highest and lowest median incomes reported by all the counties.

Median age

It is difficult to see any relationship between MedianAge and Deathrate with the values in Median Age, as there are 20 or more that seem to be incorrect (median ages of individuals 350 years+)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.30   37.70   41.00   45.27   44.00  624.00

It would be safe to assume that these are entry errors, but we will leave them alone for now but filter them out for purpose of looking for any trends in death rate and Median Age.

To get a better look to see if there is any trend when excluding the outliers we will filter out values less than 250:

With a better look at the data between MedianAge and Deathrate, there does not appear to be a relationship.

Study Per Capita

There does not appear to be much of a trend here.

Looking at the statistical summary of studyPerCap:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00  155.40   83.65 9762.31

From the statistical summary, we can see that most of the values are centered around 0. We will need to zoom in on the plot to make sure there is not any trend.

After seeing how many observations appear at 0, it would be helpful to understand how many instances of 0 studies in a county there are:

The plot shows almost 2500 counties do not have any or very few studies being done locally.

A small table showing the specific counts and proportions

Similar to the plot we

Percent County Residents Whose Highest Education is High School

There appears to be a slight positive trend between those who only who only finished high school and the cancer death rates.

Analysis Summary

Of the 6 variables analyzed 3 of the 6 looked to have no correlation between them and our target variable Death Rate. The selection process might have been a bit naive as they were based not on analysis, but preconceived ideas. If this were to be repeated, it certainly seems possible that there could be better variables selected.

Step 2 - Creating the Initial Model

We will Fit a linear model with deathRate as the target variable and the variables chosen previously as the predictors.

model1 summary

## 
## Call:
## lm(formula = TARGET_deathRate ~ avgAnnCount + incidenceRate + 
##     medIncome + MedianAge + studyPerCap + PctHS25_Over + PctPrivateCoverage + 
##     PctPublicCoverage, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -117.082  -11.968    0.441   11.655  140.421 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.122e+02  6.283e+00  17.855  < 2e-16 ***
## avgAnnCount        -8.341e-04  2.804e-04  -2.975 0.002957 ** 
## incidenceRate       2.346e-01  7.000e-03  33.515  < 2e-16 ***
## medIncome          -1.991e-04  5.483e-05  -3.630 0.000288 ***
## MedianAge          -7.115e-03  8.179e-03  -0.870 0.384405    
## studyPerCap        -7.880e-05  7.070e-04  -0.111 0.911259    
## PctHS25_Over        9.206e-01  6.427e-02  14.325  < 2e-16 ***
## PctPrivateCoverage -8.782e-01  5.767e-02 -15.227  < 2e-16 ***
## PctPublicCoverage  -1.106e-01  8.054e-02  -1.373 0.169935    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.38 on 3038 degrees of freedom
## Multiple R-squared:  0.4623, Adjusted R-squared:  0.4609 
## F-statistic: 326.5 on 8 and 3038 DF,  p-value: < 2.2e-16
  • The following variables, using an alpha of .05, are statistically significant:

    • avgAnnCount
    • incidenceRate
    • medIncome
    • PctHS25_Over
    • PctPrivateCoverage
  • The initial model has an \(R^2\) value of .4623.

  • 5/8 of the variables selected were statistically significant. The most statistically insignificant variable was studyPerCap. I assumed this might be the most significant as it could indicate counties were research was being done for high levels of cancer rates and help the model with its predictions.

I also assumed the \(R^2\) value would be higher with 8 variables selected.

Step 3

In this step we will apply two different automated methods of predictor selectors on the dataset:

After running these two methods we will examine what predictors each method removed and make a decision on what model to move forward with.

- For each procedure, submit your comment on the variables that the procedure removed from or retained in your model. Think about the following questions to guide your comments:

    - Does it match your intuition?
    
    - How do the automatically selected models compare to your model from Step 2?
    
    - Which model will you choose to proceed with?

Based on MedianAge, studyPerCap, PctPublicCoverage being statistically insignificant, I anticipate the automated selection process will suggest them being removed.

fastbw()

Running the fastbw() method gives us the following output:

## 
##  Deleted           Chi-Sq d.f. P      Residual d.f. P      AIC   R2   
##  studyPerCap       0.01   1    0.9113 0.01     1    0.9113 -1.99 0.462
##  MedianAge         0.75   1    0.3858 0.76     2    0.6823 -3.24 0.462
##  PctPublicCoverage 2.07   1    0.1507 2.83     3    0.4186 -3.17 0.462
## 
## Approximate Estimates after Deleting Factors
## 
##                          Coef      S.E.  Wald Z         P
## Intercept           1.056e+02 4.262e+00  24.785 0.0000000
## avgAnnCount        -8.499e-04 2.797e-04  -3.039 0.0023734
## incidenceRate       2.334e-01 6.939e-03  33.633 0.0000000
## medIncome          -1.703e-04 5.075e-05  -3.356 0.0007901
## PctHS25_Over        9.010e-01 6.254e-02  14.407 0.0000000
## PctPrivateCoverage -8.455e-01 5.185e-02 -16.307 0.0000000
## 
## Factors in Final Model
## 
## [1] avgAnnCount        incidenceRate      medIncome          PctHS25_Over      
## [5] PctPrivateCoverage

the three predictors initially believed to be the ones removed are indicated by fastbw() as having p-values above our indicated \(\alpha\). Estimated p-values for remaining predictors can be seen.

Creating a model with suggested predictors from fastbw():

## 
## Call:
## lm(formula = TARGET_deathRate ~ avgAnnCount + incidenceRate + 
##     medIncome + PctHS25_Over + PctPrivateCoverage, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -115.655  -12.173    0.442   11.849  140.284 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.056e+02  4.262e+00  24.786  < 2e-16 ***
## avgAnnCount        -8.499e-04  2.796e-04  -3.039  0.00239 ** 
## incidenceRate       2.334e-01  6.939e-03  33.634  < 2e-16 ***
## medIncome          -1.703e-04  5.075e-05  -3.356  0.00080 ***
## PctHS25_Over        9.010e-01  6.254e-02  14.407  < 2e-16 ***
## PctPrivateCoverage -8.455e-01  5.185e-02 -16.308  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.38 on 3041 degrees of freedom
## Multiple R-squared:  0.4618, Adjusted R-squared:  0.4609 
## F-statistic: 521.8 on 5 and 3041 DF,  p-value: < 2.2e-16

The fastbw() model has an \(R^2\) value of .4618 which is just .0005 smaller than the original model, with 3 less variables. The originally selected model and the fastbw() model have identical \(R^2_a\) values

AIC

Running the AIC method gives us the following output:

## Start:  AIC=18378.78
## TARGET_deathRate ~ avgAnnCount + incidenceRate + medIncome + 
##     MedianAge + studyPerCap + PctHS25_Over + PctPrivateCoverage + 
##     PctPublicCoverage
## 
##                      Df Sum of Sq     RSS   AIC
## - studyPerCap         1         5 1261448 18377
## - MedianAge           1       314 1261757 18378
## - PctPublicCoverage   1       782 1262225 18379
## <none>                            1261443 18379
## - avgAnnCount         1      3674 1265117 18386
## - medIncome           1      5472 1266915 18390
## - PctHS25_Over        1     85203 1346646 18576
## - PctPrivateCoverage  1     96280 1357723 18601
## - incidenceRate       1    466413 1727856 19335
## 
## Step:  AIC=18376.79
## TARGET_deathRate ~ avgAnnCount + incidenceRate + medIncome + 
##     MedianAge + PctHS25_Over + PctPrivateCoverage + PctPublicCoverage
## 
##                      Df Sum of Sq     RSS   AIC
## - MedianAge           1       312 1261760 18376
## - PctPublicCoverage   1       785 1262233 18377
## <none>                            1261448 18377
## - avgAnnCount         1      3700 1265148 18384
## - medIncome           1      5473 1266921 18388
## - PctHS25_Over        1     85978 1347426 18576
## - PctPrivateCoverage  1     97325 1358773 18601
## - incidenceRate       1    468296 1729744 19337
## 
## Step:  AIC=18375.54
## TARGET_deathRate ~ avgAnnCount + incidenceRate + medIncome + 
##     PctHS25_Over + PctPrivateCoverage + PctPublicCoverage
## 
##                      Df Sum of Sq     RSS   AIC
## <none>                            1261760 18376
## - PctPublicCoverage   1       857 1262618 18376
## - avgAnnCount         1      3663 1265423 18382
## - medIncome           1      5532 1267292 18387
## - PctHS25_Over        1     85879 1347639 18574
## - PctPrivateCoverage  1     97913 1359673 18601
## - incidenceRate       1    468174 1729934 19335

The stepAIC() selection removed studyPerCap in the first iteration, and medianAge in the second iteration. On the third pass through there was not a variable that could be removed that would lower the AIC score.

The AIC selection process removed two of the three variables I thought would be removed. The stepAIC() process did not remove the Pct Public Coverage variable.

Creating a new model with the AIC selection suggestions yields the following output:

## 
## Call:
## lm(formula = TARGET_deathRate ~ avgAnnCount + incidenceRate + 
##     medIncome + PctHS25_Over + PctPrivateCoverage + PctPublicCoverage, 
##     data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -116.893  -12.011    0.467   11.702  140.486 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.123e+02  6.280e+00  17.876  < 2e-16 ***
## avgAnnCount        -8.315e-04  2.799e-04  -2.971 0.002995 ** 
## incidenceRate       2.345e-01  6.983e-03  33.586  < 2e-16 ***
## medIncome          -1.997e-04  5.470e-05  -3.651 0.000266 ***
## PctHS25_Over        9.207e-01  6.400e-02  14.384  < 2e-16 ***
## PctPrivateCoverage -8.807e-01  5.734e-02 -15.359  < 2e-16 ***
## PctPublicCoverage  -1.155e-01  8.033e-02  -1.437 0.150727    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.37 on 3040 degrees of freedom
## Multiple R-squared:  0.4621, Adjusted R-squared:  0.4611 
## F-statistic: 435.3 on 6 and 3040 DF,  p-value: < 2.2e-16

The stepAIC() model has a .0002 \(R^2\) value compared to the original model, but it has a higher \(R^2_a\) of .0002.

The AIC selection did not remove PctPublicCoverage. It had a p-value of .16 in the original model, but with the other two predictors removed it dropped to .15, still 3x higher than the selected \(\alpha\) value. I assumed Everything that was statistically insignificant would also lead to a large enough AIC reduction when removed from the model.

Moving forward I don’t believe leaving Pct Public Coverage variable in the model adds much. I will use the fastbw() selected model going forward, which has the following variables:

  • avgAnnCount
  • incidenceRate
  • medIncome
  • PctHS25_Over
  • PctPrivateCoverage

Step 4

To check the mathematical assumptions of the model we will perform diagnostics on the model chosen in Step Three, in this case the fastbw() selected model. Checking the assumptions relies on using the model’s residuals, or the difference between the observed, or actual, value and the value that the model predicts.

Heteroscedasticity

To check for Heteroscedasicity within the model we will look at the residuals. Specifically we will look to make sure there is constant variance between the residual points.

First we will conduct the Breush-Pagan test. The statistical test tests for the following:

\(H_0: \text{homoscedasticity}\)

\(H_a: \text{heterocedasticity}\)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_fastbw_selection
## BP = 70.308, df = 5, p-value = 8.841e-14

Looking just at the bptest() we would conclude that the model assumptions are not met and we can reject the null and state there is not constant variance.

The other test we conduct to check for heteroscedasticity is by plotting the model’s fitted values compared against the model’s residuals.

The Fitted values vs. residuals plot looks to be circular and no real trends appearing in the plot, this contradicts the bptest() and would indicate the constant variance should be upheld. The reason for this is that the hypothesis test can be impacted by larger datasets. The 3047 observations can certainly do this. Thus, we can confidently say that the model’s assumption of constant variance is upheld.

Independence

To check for independence in the residuals we will also look at the residuals compared to the fitted values.

First we will conduct the Durban-Watson test which conducts the following hypothesis test:

\(H_0: \text{residuals are independent}\)

\(H_a: \text{residuals are not independent}\)

## 
##  Durbin-Watson test
## 
## data:  model_fastbw_selection
## DW = 1.6812, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0

The dwtest() has given us a p-value to say we can reject the null and state there is not independence within the residuals.

  • Similar to the Breusch-Pagan Test, the statistical test can be influenced by large datasets.

We move to the plot.

As it is the same plot as before, we not there is no trend, indicating there is no correlation between the residuals and the fitted values. Thus we can confidently say that the model’s assumption of independence is upheld.

Normality

To check if the residuals are normal distributed we will conduct the Shapiro-Wilks Test, which conducts the following hypothesis test:

\(H_0: \text{residuals are normal}\)

\(H_a: \text{residuals are NOT normal}\)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_fastbw_selection$residuals
## W = 0.98365, p-value < 2.2e-16

The shapiro.test() p-value would indicate that we have evidence to reject the null and state there is not normality in the residuals. Again, similar to the above statistical tests, the large dataset is influencing the shapiro.test(). We will turn to the Q-Q plot. Which will plot the model’s residuals against a straight line

The Q-Q plot does not have any gross derivations from normality within the residuals and the model assumption should be upheld.

There is a linear association between x and y

To conduct this test, we will use the Lagged residual plot to make sure there is trend between the residuals and a lagged version of themselves.

Lagged residual plot does not have any positive or negative trends, indicating there is not be any serial correlation. This could indicate an issue with the model assumptions

Step 5

We need to check to see if there are any particular observations in the dataset that are influencing the model to the point where the linear equation is drawn to those points. We will examine this by looking to see if there are any outliers, by looking for standard residuals and, and we will look to see if there are influential by using a method called Cooks distance.

Standard Residuals

To look at the standard residuals, we must use a function called rstandard() which will calculate the values of each residual in the model. Values over |3| are considered to be an outlier. We will put these values into a dataframe for easy manipulation.

The resulting first 6 values of the standard residuals dataframe can be seen below:

We now will filter to see if there are any values that would be considered an outlier

##  [1]  116  122  254  282  522  650  775  812  912 1048 1059 1221 1331 1366 1497
## [16] 1942 2066 2176 2549 2600 2637 2646 2659 2714 2727

There are 25 total points that would be considered outliers, or .82% of the total dataframe.

Checking the values

##  [1]  3.3981  3.5565  4.1182 -5.8087  3.1296 -3.1029  3.5457 -3.3658 -3.0171
## [10] -3.1095 -3.8121  6.8959 -3.0398  5.3300  3.4729 -3.3831 -3.1122  3.1660
## [19]  3.4358  3.2229  3.0294 -4.7967 -3.9332  4.3117  3.3617

Some of these values have sizable standardized residual values. But we should look to see with cooks distance if any of them have leverage and affect the model.

Cooks Distance

Using cooks distance we have two options to determine if a point has influence. If the value calculated for the model residuals are above 1 we would say they are influential.

Using the cooks.distance() function on the model we will create a vector of the cooks distance values. Running that and a line to code to see what the highest value’s index is and what the value is:

##       282 
## 0.2661295

A rule of thumb of leverage would be if the cooks distance is over 1. The largest value of the distances is no where near 1.

As another check we can use the F-threshold, which takes the 50th percentile of the F-Distribution using the number of observations in the model and the number of predictors in the model. The value can be seen below.

## [1] 0.891551

Now we can see if there are any cooks distances above the F-threshold:

## named integer(0)

Since there are no values above the 50th percentile threshold of the F Distribution, we can say that while there are some values that could be considered outliers, there are none that appear to have any influence/leverage and are affecting the model.

Step 6

We will investigate if a model transformation might correct the model if mathematical assumptions of the model were not met in Step Four.

The max value looks to be close to .75, but not close enough to 1 to say there would not be any benefit from a transformation.

## [1] 0.7878788

Boxcox method has suggested a value of .7879 for \(\lambda\)

Now creating a model with a Boxcox lambda transformation applied to the response variable, deathRate. We can see the model summary for the transformed model below:

## 
## Call:
## lm(formula = (TARGET_deathRate)^lambda ~ avgAnnCount + incidenceRate + 
##     medIncome + PctHS25_Over + PctPrivateCoverage, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.226  -3.122   0.197   3.160  35.341 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.985e+01  1.119e+00  35.615  < 2e-16 ***
## avgAnnCount        -2.103e-04  7.343e-05  -2.863 0.004221 ** 
## incidenceRate       6.117e-02  1.822e-03  33.573  < 2e-16 ***
## medIncome          -4.568e-05  1.333e-05  -3.428 0.000616 ***
## PctHS25_Over        2.404e-01  1.642e-02  14.641  < 2e-16 ***
## PctPrivateCoverage -2.177e-01  1.361e-02 -15.994  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.351 on 3041 degrees of freedom
## Multiple R-squared:  0.4607, Adjusted R-squared:  0.4599 
## F-statistic: 519.7 on 5 and 3041 DF,  p-value: < 2.2e-16

The Boxcox transformed model have slightly lower \(R^2\) and \(R^2_a\) values.

Now we will plot the Boxcox transformed model’s fitted values against its residuals to see if there was any improvement,

There does not seem to by any change in the diagnostic plot from the original fastbw_model_selection and the transformed fbw_bc model. This appears to prove the model did already met the mathematical assumptions of a linear model.

Step 7

We will report the final model and use it to perform inferences.

## 
## Call:
## lm(formula = TARGET_deathRate ~ avgAnnCount + incidenceRate + 
##     medIncome + PctHS25_Over + PctPrivateCoverage, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -115.655  -12.173    0.442   11.849  140.284 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.056e+02  4.262e+00  24.786  < 2e-16 ***
## avgAnnCount        -8.499e-04  2.796e-04  -3.039  0.00239 ** 
## incidenceRate       2.334e-01  6.939e-03  33.634  < 2e-16 ***
## medIncome          -1.703e-04  5.075e-05  -3.356  0.00080 ***
## PctHS25_Over        9.010e-01  6.254e-02  14.407  < 2e-16 ***
## PctPrivateCoverage -8.455e-01  5.185e-02 -16.308  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.38 on 3041 degrees of freedom
## Multiple R-squared:  0.4618, Adjusted R-squared:  0.4609 
## F-statistic: 521.8 on 5 and 3041 DF,  p-value: < 2.2e-16
Now we will look the parameter estimates and p-values for the final model:
  TARGET death Rate
Predictors Estimates p
(Intercept) 105.63 <0.001
avgAnnCount -0.00 0.002
incidenceRate 0.23 <0.001
medIncome -0.00 0.001
PctHS25 Over 0.90 <0.001
PctPrivateCoverage -0.85 <0.001
Observations 3047
R2 / R2 adjusted 0.462 / 0.461
## [1] 0.461769

Compute and report a 95% confidence interval for the slope of whichever predictor you feel is most important. - We will compute the interval for the PctHS25_Over variable

## [1] -0.778375  1.023625

We are 95% confident that the slope for PctHS25Over is between -.778375 and 1.023625.

We are computing the interval for the medians for each of the models predictors. - avgAnnCount = 171 - incidenceRate = 453.55 - medIncome = 45207 - PctHS25_Over = 35.3 - PctPrivateCoverage = 65.1

##        fit      lwr      upr
## 1 180.3991 179.6135 181.1846

We are 95% confident that a county’s target_Deathrate who has an median avgAnnCount of 171, a median incidenceRate of 453.55, a median medIncome of 45207, a median PctHS25_Over of 35.3, and a median PctPrivateCoverage of 65.1 will lie in the range between 179.61 and 181.1846.

We will use the following observation:

The prediction interval is as follows:

##        fit      lwr      upr
## 1 193.1166 153.1524 233.0809

There is a 95% probability that the target_deathrate of a county with avgAnnCount of 155, incidenceRate of 467.1, medIncome of 39303, PctHS25_Over of 39.8, and PctPrivateCoverage of 59.8 will lie in the range between 153.15 and 233.08.

Conclusion

The model itself, while being initially having some selection issues, the final selected model was able to explain 46.18% of the variation in the response variable Death rate. The final model equation was: \[\hat{y} = 105.63 - 2.103e-04x_{avgAnnCount} + .23x_{incidenceRate} -4.568e-05x_{MedIncome}+.9x_{PctHS25Over}-0.85x_{PctPrivateCoverage}\]

Showing that the most influential variables were dealing with education and insurance. More specifically we would expect for every 1 percent increase of total residents in a county that had a highest level of education being high school we expect the county’s cancer death rate to increase .9 per capita of 100,000 residents. And for the Private insurance coverage, we would expect for every 1 percent of the total county residents that increased, we would expect the cancer death rate for the county to decrease by .85 per capita of 100,000 residents.

While these might be some good insights, it would be worthwhile to investigate the other variables in the dataset more thoroughly. There were 22 other variables that could lead to a model that does a better job of explaining what factors could explain cancer death rates more thoroughly.